In this post, we will be analyzing a synthetic dataset from Kaggle that contains information from about 15,000 employees of a company regarding their satisfaction level, number of projects, seniority, and other metrics of their employment, along with a binary variable indicating whether they left the company or not. The purpose of this analysis will be to visualize multivariate relationships among the data that may explain what is causing employees to leave, and to utilize different modeling techniques to most accurately predict whether an employee will leave the company or not. First, let’s import the data and look at the structure of the variables. There are multiple factors coded as numerics, so I choose to change them all in order to help with visualizations and modeling.
set.seed(1234) #for reproducibility
library(dplyr)
hr <- read.csv("/home/evan/Downloads/HR_comma_sep.csv")
glimpse(hr)
## Observations: 14,999
## Variables: 10
## $ satisfaction_level <dbl> 0.38, 0.80, 0.11, 0.72, 0.37, 0.41, 0.10...
## $ last_evaluation <dbl> 0.53, 0.86, 0.88, 0.87, 0.52, 0.50, 0.77...
## $ number_project <int> 2, 5, 7, 5, 2, 2, 6, 5, 5, 2, 2, 6, 4, 2...
## $ average_montly_hours <int> 157, 262, 272, 223, 159, 153, 247, 259, ...
## $ time_spend_company <int> 3, 6, 4, 5, 3, 3, 4, 5, 5, 3, 3, 4, 5, 3...
## $ Work_accident <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ left <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ promotion_last_5years <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ sales <fctr> sales, sales, sales, sales, sales, sale...
## $ salary <fctr> low, medium, medium, low, low, low, low...
hr[,3] <- as.factor(hr[,3])
hr[,5] <- as.factor(hr[,5])
hr[,6] <- as.factor(hr[,6])
hr[,7] <- as.factor(hr[,7])
hr[,8] <- as.factor(hr[,8])
With that taken care of, we can begin visualizing relationships in the data, first starting with distributions of each variable through histograms and barplots. The plotly package makes it easy to include interactive plots and also integrates well with ggplot2, making it an effective choice for visualization.
library(ggplot2)
library(plotly)
p <- ggplot(data = hr)
ggplotly(p + geom_bar(aes(x = number_project)) + labs(x="Number of Projects"))
ggplotly(p + geom_histogram(aes(x = satisfaction_level)) + labs(x = "Satisfaction Level"))
ggplotly(p + geom_histogram(aes(x = last_evaluation))+ labs(x = "Last Evaluation Score"))
ggplotly(p + geom_bar(aes(x = as.factor(time_spend_company))) + labs(x = "Years Spent at Company"))
ggplotly(p + geom_histogram(aes(x = average_montly_hours)) + labs(x = "Average Monthly Hours"))
ggplotly(p + geom_bar(aes(x = as.factor(salary))) + labs(x="Salary"))
ggplotly(p + geom_bar(aes(x = as.factor(sales))) + labs(x="Department"))
We can see that most employees have 3-4 projects and have been with the company for less than three years, with few making high salaries, corresponding to the high number of sales jobs represented in the dataset. There are two large groups of employees who work ~150 hours a month and ~250 hours a month, and in general they display a wide range of satisfaction levels, though slightly more positive than negative.
library(GGally)
ggpairs(hr, columns=c("satisfaction_level", "number_project","time_spend_company"))
The GGAlly package also allows us to easily visualize multivariate relationships in the data. We can start to see some specific groups that have low satisfaction levels, likely a valuable predictor for whether an employee leaves or not, including those with 2, 6, and 7 projects and those who have been at the company for three years. Let’s see if we can illuminate these relationships further with more plots. Focusing on just the group of employees who left helps make the graphs more readable while still retaining the general trends of the rest of the data. Let’s subset the dataset and try average monthly hours based on satisfaction level, alternating the color being number of projects and seniority.
left <- subset(hr, left == 1)
names(left) <- c("satisfac", "eval", "proj", "hours", "years","accident","left","promote", "job", "salary")
ggplotly(ggplot(left, aes(x = hours, y = satisfac)) +
geom_jitter (aes(color = proj)) + labs(title = "Employee Satisfaction vs. Hours Worked", x = "Hours per Month", y = "Satisfaction Level"))
ggplotly(ggplot(left, aes(x = hours, y = satisfac)) +
geom_jitter (aes(color = years)) + labs(title = "Employee Satisfaction vs. Hours Worked", x = "Hours per Month", y = "Satisfaction Level"))
We can see many of those who left belong to three distinct categories:
Those who work ~150 hours a month, have been with the company for 3 years, have only 2 projects, and were moderately unsatisfied with their jobs (group 1, in the middle left);
Those who work 250-300 hours a month, have been with the company for 4 or 5 years, have 6-7 projects, and were extremely unsatisfied with their jobs (group 2, in the bottom right);
and those who work 175-275 hours a month, have been with the company for 4-6 years, have 4-5 projects, and were very satisfied with their jobs (group 3, in the top center).
The plots also indicate that none of the companies more senior employees (7-10 years) left.
These groups persist in the next plot, which indicates that group 1 scored relatively low on their evaluation score while groups 2 and 3 were both quite high, and shows the extreme lack of promotions among those who left, perhaps indicating lack of upwards mobility in the company as a potential reason for leaving.
ggplotly(ggplot(left, aes(x = eval, y = satisfac)) + geom_jitter(aes(col = promote)) + labs(title = "Employee Satisfaction from Last Evaluation", x = "Last Evaluation Score", y = "Satisfaction Level"))
If we come back to the full dataset, we can try another plot to visualize those who left, which confirms that the three groups exist in years 3, 4, and 5 of seniority.
ggplotly(ggplot(hr, aes(x = time_spend_company, y = average_montly_hours, col = left)) + geom_jitter() + labs(title = "Time Spent at Company vs. Average Hours Worked", x = "Years at Company", y = "Monthly Hours Worked"))
My intuition is that group 1 represents employees who were underutilized by the company due to their low number of projects, leading to boredom, mild unsatisfaction with their jobs, and a perceived lack of use from the company judging by their low evaluation scores. Group 2 is those employees who are being overutilized by the company, having incredibly long work weeks and a high number of projects at once, causing very low satisfaction but apparently good results and high evaluation scores. Group 3, being both highly satisfied and well-rated on their last evaluation, seem to thrive in the pressure put on by the company of having more than a handful of projects at once and long work weeks, and likely left for reasons out of the company’s control like a spouse moving or a more attractive job offer. We will refer to these groups later when catogorizing the data.
As an alternate method of visualization, we can use a circle plot to represent relationships between continuous variables. In this case, we can use satisfaction level and last evaluation on the x and y axes and average monthly hours worked as the size of the circle. Observe the same general clusters of employees who left (represented by ones) that persist in the plot.
hr1 <- sample_n(hr, size = 200, replace = F)
radius <- sqrt(hr1$average_montly_hours/ pi)
xVar<-hr1$last_evaluation
yVar<-hr1$satisfaction_level
labelVar<-hr1$left
symbols(xVar, yVar, circles=radius, inches=0.20, fg="white", bg="red", xlab="Last Evaluation Score", ylab="Satisfaction Level")
text(xVar, yVar, labelVar, cex = 0.75)
We can also try using heatmaps with satisfaction level as the z-variable (or “heat” value for each cell of the map) using different x and y variables. The plots allow us to get a good sense of the distribution of employee satisfaction through the company and visualize the dataset’s clusters in a new way.
x <- list(
title = "Years at Company"
)
y <- list(
title = "Average Monthly Hours"
)
plot_ly(z = hr$satisfaction_level, x = hr$time_spend_company, y = hr$average_montly_hours, type = "heatmap") %>% layout(title = "Years Worked/Hours Monthly Worked by Satisfaction Level", xaxis = x, yaxis = y)
x <- list(
title = "Number of Projects"
)
y <- list(
title = "Average Monthly Hours"
)
plot_ly(z = hr$satisfaction_level, x = hr$number_project, y = hr$average_montly_hours, type = "heatmap") %>% layout(title = "Number of Projects/Hours Monthly Worked by Satisfaction Level", xaxis = x, yaxis = y)
x <- list(
title = "Left"
)
y <- list(
title = "Average Monthly Hours"
)
plot_ly(z = hr$satisfaction_level, x = hr$left, y = hr$average_montly_hours, type = "heatmap") %>% layout(title = "Left/Hours Monthly Worked by Satisfaction Level", xaxis = x, yaxis = y)
The graphs again point to those with 6-7 projects being heavily worked and very unsatisfied, with many of them having been employees for 4 years, as well as the group of 3rd year employees who are underworked and also unsatisfied. We can also easily identify group 1 as the light blue cluster in either graph and group 2 as the purple cluster, and observe the noticable increase hours worked of those who left compared to those who didn’t, which is represented even more clearly in the following boxplot.
ggplot(hr) + geom_boxplot(aes(x=left, y=average_montly_hours)) + labs(x = "Left", y="Average Monthly Hours")
Now that we have better visualized what may be impacting employee churn in the dataset, let’s attempt to predict whether an employee left or not based on their other characteristics. After creating training and test sets, we will first use linear regression to attempt to predict an employee’s satisfaction level based on the variables besides left in the dataset. While the previous plots indicate the relationship between satisfaction level and the other continuous variables are too heavily clustered to be truly linear, it offers a simple first approach to modeling the data and the presence of the factored variables like number of projects and seniority may have more association with satisfaction level than we initially observe.
data.size<-nrow(hr)
train.size<-0.80
train.row.nums<-sample(1:data.size, data.size*train.size, replace=FALSE)
train.data<-subset(hr[train.row.nums,])
test.row.nums<-setdiff(1:data.size,train.row.nums)
test.data<-subset(hr[test.row.nums,])
true.labels<-test.data[,7]
We can initially fit the model to every variable, then use stepwise elimination to remove non-significant variables based on their Alkaike information criterion (AIC), a metric that avoids overfitting of the model by introducing a penalty term based on the model’s goodness of fit, which is almost always increased by including more terms in the model.
fit <- lm(satisfaction_level ~., data=train.data[,-7])
fit <- step(fit)
## Start: AIC=-38544.39
## satisfaction_level ~ last_evaluation + number_project + average_montly_hours +
## time_spend_company + Work_accident + promotion_last_5years +
## sales + salary
##
## Df Sum of Sq RSS AIC
## - salary 2 0.037 480.91 -38547
## <none> 480.88 -38544
## - promotion_last_5years 1 0.099 480.98 -38544
## - Work_accident 1 0.104 480.98 -38544
## - sales 9 0.994 481.87 -38538
## - average_montly_hours 1 0.526 481.40 -38533
## - last_evaluation 1 4.452 485.33 -38436
## - time_spend_company 7 11.687 492.56 -38270
## - number_project 5 181.936 662.81 -34704
##
## Step: AIC=-38547.45
## satisfaction_level ~ last_evaluation + number_project + average_montly_hours +
## time_spend_company + Work_accident + promotion_last_5years +
## sales
##
## Df Sum of Sq RSS AIC
## <none> 480.91 -38547
## - Work_accident 1 0.103 481.02 -38547
## - promotion_last_5years 1 0.108 481.02 -38547
## - sales 9 0.981 481.90 -38541
## - average_montly_hours 1 0.526 481.44 -38536
## - last_evaluation 1 4.437 485.35 -38439
## - time_spend_company 7 11.685 492.60 -38273
## - number_project 5 183.087 664.00 -34687
summary(fit)
##
## Call:
## lm(formula = satisfaction_level ~ last_evaluation + number_project +
## average_montly_hours + time_spend_company + Work_accident +
## promotion_last_5years + sales, data = train.data[, -7])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.64708 -0.11936 -0.01086 0.13877 0.78835
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.792e-01 1.360e-02 27.884 < 2e-16 ***
## last_evaluation 1.271e-01 1.210e-02 10.510 < 2e-16 ***
## number_project3 1.874e-01 6.319e-03 29.663 < 2e-16 ***
## number_project4 1.928e-01 6.359e-03 30.328 < 2e-16 ***
## number_project5 1.779e-01 7.062e-03 25.194 < 2e-16 ***
## number_project6 -1.954e-01 9.235e-03 -21.160 < 2e-16 ***
## number_project7 -3.483e-01 1.592e-02 -21.878 < 2e-16 ***
## average_montly_hours 1.525e-04 4.215e-05 3.619 0.000297 ***
## time_spend_company3 -1.838e-02 4.977e-03 -3.694 0.000222 ***
## time_spend_company4 -1.005e-01 6.367e-03 -15.790 < 2e-16 ***
## time_spend_company5 -4.544e-02 7.344e-03 -6.187 6.32e-10 ***
## time_spend_company6 -7.099e-02 9.396e-03 -7.555 4.48e-14 ***
## time_spend_company7 -3.974e-02 1.733e-02 -2.293 0.021838 *
## time_spend_company8 -1.003e-02 1.765e-02 -0.568 0.569926
## time_spend_company10 -3.296e-02 1.569e-02 -2.101 0.035682 *
## Work_accident1 8.406e-03 5.250e-03 1.601 0.109399
## promotion_last_5years1 2.112e-02 1.291e-02 1.637 0.101739
## saleshr 6.356e-03 1.166e-02 0.545 0.585548
## salesIT 2.481e-02 1.046e-02 2.373 0.017664 *
## salesmanagement 2.152e-02 1.235e-02 1.743 0.081423 .
## salesmarketing 2.302e-02 1.133e-02 2.032 0.042156 *
## salesproduct_mng 4.198e-02 1.114e-02 3.768 0.000165 ***
## salesRandD 1.615e-02 1.148e-02 1.407 0.159518
## salessales 3.114e-02 9.015e-03 3.454 0.000555 ***
## salessupport 2.717e-02 9.564e-03 2.841 0.004509 **
## salestechnical 2.767e-02 9.344e-03 2.961 0.003070 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2004 on 11973 degrees of freedom
## Multiple R-squared: 0.3535, Adjusted R-squared: 0.3521
## F-statistic: 261.9 on 25 and 11973 DF, p-value: < 2.2e-16
anova(fit)
## Analysis of Variance Table
##
## Response: satisfaction_level
## Df Sum Sq Mean Sq F value Pr(>F)
## last_evaluation 1 6.54 6.539 162.7988 < 2.2e-16 ***
## number_project 5 243.12 48.623 1210.5409 < 2.2e-16 ***
## average_montly_hours 1 0.20 0.203 5.0638 0.024449 *
## time_spend_company 7 11.90 1.700 42.3260 < 2.2e-16 ***
## Work_accident 1 0.12 0.120 2.9873 0.083947 .
## promotion_last_5years 1 0.09 0.088 2.1843 0.139447
## sales 9 0.98 0.109 2.7145 0.003695 **
## Residuals 11973 480.91 0.040
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model’s coefficients with significant p-values confirm the negative impact that working 6-7 projects and spending 5-7 years at the company have, and the ANOVA test indicates number of projects had by the far the most impact on the model’s accuracy. To test its predictive ability:
actual_pred <- test.data[,1]
fit.predicted <- predict(fit, test.data[,-1], interval="predict")
#MSE train
mean(fit$residuals^2)
## [1] 0.04007954
#MSE test
mean((test.data$satisfaction_level - fit.predicted)^2)
## [1] 0.142682
The model is able to closely match the training data, but fares noticably worse on the test set, with a .1 higher mean square error (MSE) percentage. Let’s check some diagnostics of the plot using the ggfortify package.
library(ggfortify)
autoplot(fit)
The residuals vs. fitted plot shows that the residuals follow a roughly linear pattern, though generally skewed downwards.
The Q-Q plot shows the residuals are generally normally distributed, but taper off somewhat at the -2 and 2 quantiles.
The scale-location plot shows that the model lacks completely equal variance judging by the curved line among the standardized residuals which was likely caused by the three large clusters present in the data.
The residuals vs. leverage plot shows that no outliers especially affected the model.
The KNN algorithm plots the data in n-dimensional space (where n is the number of variables in the dataset) and calculate the k closest points to the current observation: whichever classification is most exhibited by its neighbors will be chosen for the current observation. This makes it ideal for classification tasks, including binary prediction like this. Because KNN works only with continuous variables, we can create dummy variables corresponding to each possible factor in order to convert the non-continuous values i.e. salary will turn into three variables for low, medium, and high, each being a 1 if they fall into that category and a 0 otherwise. This allows all variables to be plotted in vector space, which is how the algorithm calculates Euclidean distance between observations. We will also normalize the data so all values are in the range 0-1 to help with distance calculation.
library(dummies)
train.knn <- dummy.data.frame(train.data, names=c("number_project", "time_spend_company", "Work_accident", "promotion_last_5years", "sales", "salary"), sep="_")
train.labels <- select(train.knn, left)
train.knn <- select(train.knn, -left)
test.knn <- dummy.data.frame(test.data, names=c("number_project", "time_spend_company", "Work_accident", "promotion_last_5years", "sales", "salary"), sep="_")
test.knn <- select(test.knn, -left)
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
train.knn <- as.data.frame(lapply(train.knn, normalize))
test.knn <- as.data.frame(lapply(test.knn, normalize))
We will first try k = 10, meaning each prediction will be compared to its 10 closest neighbors and the label that appears most will be selected.
library(class)
cl <- train.labels[,1]
pred.labels<-knn(train.knn, test.knn, cl, k=10)
conf.matrix <- table(pred.labels, true.labels)
conf.matrix
## true.labels
## pred.labels 0 1
## 0 2166 71
## 1 110 653
#misclassification rate
1 - sum(diag(conf.matrix))/sum(conf.matrix)
## [1] 0.06033333
We can also try many different values for k and plot the accuracy for each to determine the best possible model.
for (k in seq(10, 50, 5)) {
pred.labels<-knn(train.knn, test.knn, cl, k)
num.incorrect.labels<- sum(pred.labels!=true.labels)
misc.rate <- num.incorrect.labels/length(true.labels)
cat(" ",k," ", misc.rate,"\n")
}
## 10 0.059
## 15 0.06266667
## 20 0.068
## 25 0.06833333
## 30 0.06733333
## 35 0.065
## 40 0.06533333
## 45 0.06666667
## 50 0.06633333
It appears that k = 10 gives the lowest misclassification rate. We can also see the inherent variability in the algorithm given that the misclassification rate for this k=10 is slightly lower than the previously fit model. Such model-fitting differences can be tested and quantified through cross-validation, which we will experiment with later in this analysis.
K-means clustering in an unsupervised algorithm that tries to find natural clusters among the data that minimize within-cluster variation while maximizing between-cluster variation. While this sort of clustering will not help us directly predict whether an employee will leave or not, we can see if the clusters the algorithm creates match the ones we noticed in the plots before, and create new visualizations using the clusters assigned to each employee. In a similar manner to before, we can try many different values of k for the number of clusters for the algorithm and put the data into using a graph known as a WSS plot using the same scaled set of data. The best value of k will appear at a sort of “elbow” in the plot where the within group sum of squares stops meaningfully dropping after adding more clusters.
wssplot <- function(data, nc=15, seed = 1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
}
wssplot(train.knn)
For this plot, a good value for k seems to be 4, so let’s try that.
clust <- kmeans(train.knn, 4, nstart=40)
We can create some interesting visualizations based on the clusters:
train.knn$cluster <- as.factor(clust$cluster)
ggplotly(ggplot(train.knn, aes(satisfaction_level, group = cluster, fill = cluster)) + geom_density() + facet_grid(.~cluster) + labs(x = "Satisfaction Level"))
ggplotly(ggplot(train.knn, aes(average_montly_hours, group = cluster, fill = cluster)) + geom_density() + facet_grid(.~cluster) + labs(x = "Average Monthly Hours"))
ggplotly(ggplot(train.knn, aes(satisfaction_level, fill=cluster)) + geom_histogram() + labs(x="Satisfaction Level"))
ggplotly(ggplot(train.knn, aes(average_montly_hours, fill=cluster)) + geom_histogram() + labs(x="Average Monthly Hours"))
We can see from the density plots that cluster 4 definitely represents the employees who are underworked and unsatisifed, while cluster 2 probably includes the ones who are overworked and unsatisfied. Cluster 1 seems to be the group who work average hours and are mostly satisfied, with cluster 3 containing all the rest of the employees. These cluster groupings help affirm some of the groups that we had previously noticed before, and the stacked histogram plots offer yet another way of visualizing the difference of hours and satisfaction for each group of employees, which can each be targetted more effectively by the company once their common characteristics have been determined.
Logistic regression is a generalized linear model that, instead of predicting continuous values, puts all predictions into the range of 0-1 through an exponential transformation function. This makes it ideal for binary classification tasks, although it can be expanded to multiple response variables through multinomial (or softmax) regression. Using a logistic regression model, we can predict the binary variable left directly. Let’s fit a model using the same variables as the linear model, then examine the coefficients and variable importance.
library(caret)
glm_mod <-glm(left~last_evaluation + number_project + time_spend_company + promotion_last_5years + Work_accident + average_montly_hours, family=binomial(link="logit"), data=hr)
exp(glm_mod$coefficients)
## (Intercept) last_evaluation number_project3
## 1.461489e-02 6.315944e+00 3.594731e-03
## number_project4 number_project5 number_project6
## 1.259923e-02 2.424613e-02 1.070191e-01
## number_project7 time_spend_company3 time_spend_company4
## 7.528409e+06 1.853868e+01 2.827630e+01
## time_spend_company5 time_spend_company6 time_spend_company7
## 1.486829e+02 4.649765e+01 5.774585e-07
## time_spend_company8 time_spend_company10 promotion_last_5years1
## 7.940810e-07 6.697839e-07 2.774067e-01
## Work_accident1 average_montly_hours
## 2.187163e-01 1.008552e+00
varImp(glm_mod)
## Overall
## last_evaluation 8.98521829
## number_project3 39.28213778
## number_project4 42.61144562
## number_project5 35.75103852
## number_project6 18.65497628
## number_project7 0.04197069
## time_spend_company3 17.97299928
## time_spend_company4 19.01750381
## time_spend_company5 28.55046399
## time_spend_company6 20.24082592
## time_spend_company7 0.03615380
## time_spend_company8 0.03242775
## time_spend_company10 0.03775533
## promotion_last_5years1 4.11825346
## Work_accident1 14.22082307
## average_montly_hours 11.95710147
While many of the model’s coefficients don’t give us much insight, we can see the high positive impact working 7 projects has on whether one will leave or not, and the variable importance plots shows again the influence number of projects and seniority have on the models we have fit thus far. We can now fit the model against our test data and get its predictions. Since logistic regression outputs probabilities rather than direct binary predictions, we must set a threshold at which observations will be considered a 1 or not, and it’s typical to use 0.5 initially.
model.predictions<-predict(glm_mod, newdata=test.data[,-7], type="response")
class.threshold<-0.5
pred.labels<-rep("no",length(true.labels))
pred.labels[model.predictions>class.threshold]="yes"
conf.matrix <- table(true.labels, pred.labels)
conf.matrix
## pred.labels
## true.labels no yes
## 0 2116 160
## 1 127 597
#misclassification rate
1 - sum(diag(conf.matrix)) / sum(conf.matrix)
## [1] 0.09566667
But what if there’s a better classification threshold for the data? Let’s try plotting prediction accuracy versus different cutoffs and determine which one maximizes accuracy, then reclassify the predictions with the new threshold.
library(ROCR)
pred<-prediction(model.predictions,true.labels)
acc <- performance(pred, "acc")
ac.val <- max(unlist(acc@y.values))
best.thresh <- unlist(acc@x.values)[unlist(acc@y.values) == ac.val]
plot(acc)
abline(v=best.thresh , col='grey', lty=2)
More concretely, we can use ~.4663 as the threshold in our new prediction.
best.thresh
## 992 639 915
## 0.4860402 0.4662581 0.4599649
class.threshold<-0.4663
pred.labels<-rep("no",length(true.labels))
pred.labels[model.predictions>class.threshold]="yes"
#confusion matrix
conf.matrix <- table(true.labels, pred.labels)
conf.matrix
## pred.labels
## true.labels no yes
## 0 2106 170
## 1 117 607
#misclassification rate
1 - sum(diag(conf.matrix)) / sum(conf.matrix)
## [1] 0.09566667
Only a marginal increase, but it was still worth investigating. Although it has less accurate predictions than KNN in this case, logistic regression is useful because it can help us understand the variables that may be significant in predicting an employee leaving. Another way to assess a logistic regression model is through an ROC curve, which plots the model’s true positive versus false positive rate; a successful model will minimze the area under the curve (AUC) in order to maximize its true positive rate. We can both plot this model’s ROC curve and calculate its precise AUC.
pred<-prediction(model.predictions,true.labels)
perf<-performance(pred,'tpr','fpr')
plot(perf,main="ROC")
abline(0,1,lty=3)
perf.auc<- performance(pred,'auc',fpr.stop=1)
auc<-slot(perf.auc,"y.values")[[1]]
auc
## [1] 0.9285018
One more approach to binary classification is Naive Bayes, which unlike logistic regression uses a view that all variables are independent (not correlated) to build up a prior belief about the event it is predicting. It also converges faster than logistic regression, making it better on a relatively small dataset such as this one where logistic regression has the potential to overfit.
library(e1071)
bayesMod <- naiveBayes(left ~ ., data = train.data)
pred<- predict(bayesMod, as.data.frame(test.data)[,-7])
conf.matrix <- table(pred, true.labels)
conf.matrix
## true.labels
## pred 0 1
## 0 2056 220
## 1 220 504
1 - sum(diag(conf.matrix)) / sum(conf.matrix)
## [1] 0.1466667
However, Naive Bayes’ assumption of independence does not work so well on the model given its higher misclassification rate, and it seems fair to assume average monthly hours, number of projects, and seniority may have some association. Let’s create a correlation matrix for the dataset to confirm this, first converting the factor variables back to numerics in order for the function to work.
hrCorr <- hr
names(hrCorr) <- c("satisfac", "eval", "proj", "hours", "years","accident","left","promotion", "job", "salary")
hrCorr[,3] <- as.numeric(hrCorr[,3])
hrCorr[,5] <- as.numeric(hrCorr[,5])
hrCorr[,6] <- as.numeric(hrCorr[,6])
hrCorr[,7] <- as.numeric(hrCorr[,7])
hrCorr[,8] <- as.numeric(hrCorr[,8])
ggcorr(hrCorr, palette = "RdBu", label = TRUE) + ggtitle("Correlation Matrix")
We can see that some of the variables do indeed exhibit some correlation, so perhaps Naive Bayes is not the best choice for this prediction task.
Another model that can be used for prediction is the decision tree, which creates a tree where the interior nodes are variables in the dataset and the leaf nodes are the predictions. Each prediction starts at the root node, then follows a path down the tree based on that observation’s value of the particular variable at each node (i.e. if salary is high, go left, else go right). This allows for greater interpretability of the model as its decision-making process for predictions is completely clear and can easily be visualized. The optimal tree partitions the variables in such a way that maximizes information gain (or entropy), a way to judge goodness of predictions based on the “pureness” of variables in each predicted class. A binary decision tree, for example, will want each leaf node to contain either as many 1s or 0s as possible, without mixing the two. For multiclass prediction tasks or regression trees, it will want to have as many similar observations together as possible, although this is much harder to fully achieve than with only two options for prediction. We can try using a regression tree to predict satisfaction level using the same variables used in the above models. Initially, a model was fit with every variable in the model, but since only the most significant variables are actually used in the tree, we ended up with very similar variables that were used in the models above.
library(rpart)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
#choose variables that are actually used in model
tree1 <- rpart(satisfaction_level ~ last_evaluation + number_project + average_montly_hours, method="anova", data=train.data)
fancyRpartPlot(tree1)
varImp(tree1)
## Overall
## average_montly_hours 0.5020711
## last_evaluation 0.5185606
## number_project 1.0891462
From the plot, its easy to judge some factors that lead to lower satisfaction levels, including working 2,6, and 7 projects and working more than 244 hours a month, and the variable importance of the tree indicates that number of projects is the highest predictor in the model. Because decision trees have a tendency to overfit the given dataset, we can specify a complexity parameter (cp) that limits how large the tree will grow to avoid overfitting; this process is similar to the step function used previously in linear regression. We can try finding the optimal cp, then pruning the tree and visualizing it again.
printcp(tree1)
##
## Regression tree:
## rpart(formula = satisfaction_level ~ last_evaluation + number_project +
## average_montly_hours, data = train.data, method = "anova")
##
## Variables actually used in tree construction:
## [1] average_montly_hours last_evaluation number_project
##
## Root node error: 743.86/11999 = 0.061994
##
## n= 11999
##
## CP nsplit rel error xerror xstd
## 1 0.271205 0 1.00000 1.00021 0.0105449
## 2 0.052959 1 0.72880 0.72923 0.0098389
## 3 0.038713 2 0.67584 0.68115 0.0102984
## 4 0.010960 3 0.63712 0.63999 0.0095441
## 5 0.010405 4 0.62616 0.62827 0.0094920
## 6 0.010000 5 0.61576 0.61859 0.0094055
plotcp(tree1)
An optimal cp seems to be 0.01, which minimizes the relative error of the tree, so we can try using that. However, once plotted our model does not change, so it appears the best cp was already being used.
tree2 <- prune(tree1, cp = 0.01)
fancyRpartPlot(tree2)
The last step is to try predicting satisfaction level with the new tree. We end up with a test MSE of ~0.039, which is noticably better than the linear model but still has room for improvement.
tree1.pred <- predict(tree1, newdata = test.data[,-1], type = "vector")
mean((test.data$satisfaction_level - tree1.pred)^2)
## [1] 0.03883447
Now we can use a similar approach but this time use a classification tree to predict left.
tree3 <- rpart(left ~ last_evaluation + number_project + average_montly_hours + satisfaction_level, data=train.data)
prp(tree3)
varImp(tree3)
## Overall
## average_montly_hours 1234.829
## last_evaluation 1021.550
## number_project 2236.070
## satisfaction_level 2952.346
We can visualize the factors that may lead an employee to leave the company, and see that satisfaction level and number of projects are the greatest predictors of whether an employee leaves or not. We can also try altering the model based on a different cp, although again the pruned model is no different than the model already fitted.
printcp(tree3)
##
## Classification tree:
## rpart(formula = left ~ last_evaluation + number_project + average_montly_hours +
## satisfaction_level, data = train.data)
##
## Variables actually used in tree construction:
## [1] average_montly_hours last_evaluation number_project
## [4] satisfaction_level
##
## Root node error: 2847/11999 = 0.23727
##
## n= 11999
##
## CP nsplit rel error xerror xstd
## 1 0.318581 0 1.00000 1.00000 0.0163679
## 2 0.216368 1 0.68142 0.68142 0.0141651
## 3 0.029680 2 0.46505 0.46505 0.0120550
## 4 0.027924 6 0.34633 0.31507 0.0101190
## 5 0.025290 10 0.23463 0.24271 0.0089634
## 6 0.012996 11 0.20934 0.21602 0.0084845
## 7 0.010000 12 0.19635 0.20091 0.0081979
plotcp(tree3)
tree4 <- prune(tree3, cp = 0.01)
prp(tree4)
To try some predictions with the classification tree:
tree2.pred <- predict(tree3, newdata = test.data[,-7], type = "class")
conf.mat <- table(tree2.pred, test.data[,7])
conf.mat
##
## tree2.pred 0 1
## 0 2220 79
## 1 56 645
1 - sum(diag(conf.mat)) / sum(conf.mat)
## [1] 0.045
We are able to get a misclassification rate of ~0.045, which is more promising but still could be lower.
Random forests attempt to improve upon the tendancy of overfitting and potential for lower accuracy of decision trees through aggregating the results of thousands of decision trees at once. One such method is known as bagging (short for bootstrap aggregation), in which the data is sampled with replacement multiple times, with each sample modeled by an unpruned tree and tested on the rest of the data (this is similar to cross-valdiation of models). Once many decision trees have been created, the results are averaged into a single model. The process is bagging when the number of times the algorithm is run is equal to the number of variables in the dataset - in our case, that means 9 tries.
library(randomForest)
bag.hr<-randomForest(left~., data=train.data, mtry=9, importance=TRUE)
A nice thing about random forests is that they offer an easy way to visualize variable importance. Aside from the mean accuracy decrease by excluding the variable in the model, we can also judge a variable’s influence by its mean decrease in Gini, a measure of the average entropy gained by splitting the tree at each variable i.e. the amount of mixed label nodes that are split into single label nodes (thus maximizing their entropy). These plots indicate again that satisfacation level and number of projects have the highest influence on employees leaving in the model.
importance(bag.hr)
## 0 1 MeanDecreaseAccuracy
## satisfaction_level 133.511795 422.95511 374.31536
## last_evaluation 38.819483 175.79269 173.86955
## number_project 89.352999 385.19855 348.02278
## average_montly_hours 128.848683 146.22947 170.95132
## time_spend_company 75.542969 228.66860 237.97403
## Work_accident 4.870079 18.43369 18.50575
## promotion_last_5years -1.171823 22.86968 11.93556
## sales 9.056858 81.63699 51.56396
## salary 15.715483 36.21960 35.96720
## MeanDecreaseGini
## satisfaction_level 1399.2542275
## last_evaluation 322.4152640
## number_project 1451.8773439
## average_montly_hours 311.7044033
## time_spend_company 790.3510630
## Work_accident 3.1738619
## promotion_last_5years 0.8109428
## sales 49.3051031
## salary 14.1709057
varImpPlot(bag.hr)
Now for some predictions:
bag.pred<-predict(bag.hr, newdata=test.data)
sum(bag.pred!=true.labels) / length(true.labels)
## [1] 0.009666667
Only about 1 percent misclassification rate! While this may be hard to improve on, we can try another random forest technique known as boosting, where trees are sequentially grown using error information gained from previous iterations to “boost” the current tree. Let’s try predicting satisfaction level using 5000 trees.
library(gbm)
boost.hr.sat<-gbm(satisfaction_level~., data=train.data[,-7], n.trees=5000, interaction.depth=4, distribution = "gaussian")
summary(boost.hr.sat)
## var rel.inf
## number_project number_project 67.383583057
## average_montly_hours average_montly_hours 17.704702381
## time_spend_company time_spend_company 8.435875366
## last_evaluation last_evaluation 5.771458378
## sales sales 0.659240471
## salary salary 0.028225714
## Work_accident Work_accident 0.010081661
## promotion_last_5years promotion_last_5years 0.006832972
For predictions:
boost.pred.sat<-predict(boost.hr.sat, newdata=test.data[,-1], n.trees = 5000)
mean((boost.pred.sat-test.data[,1])^2)
## [1] 0.03521229
We can see that the regression tree predicting satisfaction level gives us our lowest MSE yet of ~0.035.
Now that we’ve tried a multitude of different modeling approaches, we can pick the best one (in the case of predicting left, bagged random forest) and try fitting it multiple times over different subsets of the data to see how its accuracy fluctuates over different training sets.
num.folds<-10
folds<-floor(runif(data.size)*num.folds)+1
fold <- factor(folds)
data.set<-hr
data.size<-nrow(data.set)
data.cols<-ncol(data.set)
num.folds<-10
# add the fold column to the data set
data.set["fold"]<-floor(runif(data.size)*num.folds)+1
data.set$fold<-factor(data.set$fold)
#collect each iteration's misclassification rate
misclassification.rates<-c()
for(i in c(1:num.folds)){
# create test and train, excluding the fold column
train<-data.set[(data.set$fold!=i), 1:(data.cols)]
test<-data.set[(data.set$fold==i),1:(data.cols)]
#fit model
bag.hr<-randomForest(left~., data=train, mtry=9, importance=TRUE)
bag.pred<-predict(bag.hr, newdata=test.data)
# calculate misclassification rate and append
misc.rate<-sum(bag.pred!=true.labels) / length(true.labels)
misclassification.rates<-c(misclassification.rates, misc.rate)
}
plot(misclassification.rates, xlab="Fold", ylab="Misclassification Rate")
lines(misclassification.rates)
summary(misclassification.rates)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0003333 0.0004167 0.0008333 0.0008333 0.0012500 0.0013330
sd(misclassification.rates)
## [1] 0.0004230985
hist(misclassification.rates)
The accuracy split over multiple models is even better than our initial value, with the mean misclassification rate being even lower than the what we were able to achieved before. The low standard deviation shows that the model does not have the risk of having wildly different prediction rates when fit on different subsets of the data, a major benefit if the model were ever to be improved upon and put into production by the company.
In this post, we were able to explore this dataset and create many visualizations that highlight what could be causing employees to leave while fitting and testing many different models, whose accuracies are compiled here. Ultimately, the random forest models performed best in predicting both variables. A general trend among the models was the impact that number of projects and seniority had on the model’s predictions and coefficients, two variables we examined further in the visualizations section of the post. However, given that the employees with high projects came almost solely from those with seniority of years 4-5, its safe to assume that the high influence of those factors come from the amount of projects and satisfactions of these employees, not their seniority year. This leaves amount of projects as the number one variable in predicting an employee’s satisfaction level and whether they will leave the company or not.
| Method | Variable Predicted | Misclassification Rate (Test) |
|---|---|---|
| Linear Regression | satisfaction_level |
0.143 |
| KNN | left |
0.059 |
| Logistic Regression | left |
0.095 |
| Naive Bayes | left |
0.147 |
| Decision Tree | satisfaction_level |
0.039 |
| Decision Tree | left |
0.045 |
| Random Forest (Bagged) | left |
0.010 |
| Random Forest (Boosted) | satisfaction_level |
0.034 |
Through the analysis we carried out, three main groups of leaving employees were identified - those who are underutilized and unhappy, those who are overworked and unhappy, and those who work heavily but stay satisfied with their jobs. I suggest that the company focus on groups 1 and 2, rather than those in group 3 who more likely leave for various external reasons outside of the scope of the company. After having a broad review of their project planning and scheduling procedures, the company can address each group of unsatisfied employees delegating projects and hours more equivalently between the two groups to address each of their concerns.
Employees in group 1 should be given either multiple small projects or fewer more substantial projects based on the employee’s preference, and they should be given the preference of picking up more hours on an employee-by-employee basis. For the overutilized group 2, they can be given a smaller amount of meaningful projects to occupy their time with rather than the clearly too-great amount they are being assigned now. If the company has trouble adapting to restructuring around this more suitable work balance for the sake of its employees, I suspect there may be greater problems than a data scientist can solve regarding their employment policies and managerial expectations of work that gets done at their company.